1 00:00:07,641 --> 00:00:10,308 - So welcome everyone to CS231n. 2 00:00:11,762 --> 00:00:14,235 I'm super excited to offer this class again 3 00:00:14,235 --> 00:00:15,507 for the third time. 4 00:00:15,507 --> 00:00:17,568 It seems that every time we offer this class 5 00:00:17,568 --> 00:00:21,523 it's growing exponentially unlike most things in the world. 6 00:00:21,523 --> 00:00:24,434 This is the third time we're teaching this class. 7 00:00:24,434 --> 00:00:26,466 The first time we had 150 students. 8 00:00:26,466 --> 00:00:29,000 Last year, we had 350 students, so it doubled. 9 00:00:29,000 --> 00:00:32,852 This year we've doubled again to about 730 students 10 00:00:32,852 --> 00:00:34,806 when I checked this morning. 11 00:00:34,806 --> 00:00:38,428 So anyone who was not able to fit into the lecture hall 12 00:00:38,428 --> 00:00:40,094 I apologize. 13 00:00:40,094 --> 00:00:43,189 But, the videos will be up on the SCPD website 14 00:00:43,189 --> 00:00:44,931 within about two hours. 15 00:00:44,931 --> 00:00:46,900 So if you weren't able to come today, 16 00:00:46,900 --> 00:00:50,889 then you can still check it out within a couple hours. 17 00:00:50,889 --> 00:00:55,076 So this class CS231n is really about computer vision. 18 00:00:55,076 --> 00:00:57,412 And, what is computer vision? 19 00:00:57,412 --> 00:01:00,141 Computer vision is really the study of visual data. 20 00:01:00,141 --> 00:01:02,578 Since there's so many people enrolled in this class, 21 00:01:02,578 --> 00:01:04,522 I think I probably don't need to convince you 22 00:01:04,522 --> 00:01:06,219 that this is an important problem, 23 00:01:06,219 --> 00:01:10,032 but I'm still going to try to do that anyway. 24 00:01:10,032 --> 00:01:11,895 The amount of visual data in our world 25 00:01:11,895 --> 00:01:14,173 has really exploded to a ridiculous degree 26 00:01:14,173 --> 00:01:15,761 in the last couple of years. 27 00:01:15,761 --> 00:01:17,613 And, this is largely a result of the large number 28 00:01:17,613 --> 00:01:20,398 of sensors in the world. 29 00:01:20,398 --> 00:01:21,759 Probably most of us in this room 30 00:01:21,759 --> 00:01:23,064 are carrying around smartphones, 31 00:01:23,064 --> 00:01:25,004 and each smartphone has one, two, 32 00:01:25,004 --> 00:01:26,989 or maybe even three cameras on it. 33 00:01:26,989 --> 00:01:28,974 So I think on average there's even more cameras 34 00:01:28,974 --> 00:01:31,114 in the world than there are people. 35 00:01:31,114 --> 00:01:32,765 And, as a result of all of these sensors, 36 00:01:32,765 --> 00:01:35,371 there's just a crazy large, massive amount 37 00:01:35,371 --> 00:01:37,524 of visual data being produced out there in the world 38 00:01:37,524 --> 00:01:38,508 each day. 39 00:01:38,508 --> 00:01:41,239 So one statistic that I really like to kind of put 40 00:01:41,239 --> 00:01:43,858 this in perspective is a 2015 study 41 00:01:43,858 --> 00:01:47,025 from CISCO that estimated that by 2017 42 00:01:48,919 --> 00:01:51,784 which is where we are now that roughly 80% 43 00:01:51,784 --> 00:01:54,484 of all traffic on the internet would be video. 44 00:01:54,484 --> 00:01:58,074 This is not even counting all the images 45 00:01:58,074 --> 00:02:00,525 and other types of visual data on the web. 46 00:02:00,525 --> 00:02:03,880 But, just from a pure number of bits perspective, 47 00:02:03,880 --> 00:02:06,002 the majority of bits flying around the internet 48 00:02:06,002 --> 00:02:07,476 are actually visual data. 49 00:02:07,476 --> 00:02:09,547 So it's really critical that we develop algorithms 50 00:02:09,547 --> 00:02:13,157 that can utilize and understand this data. 51 00:02:13,157 --> 00:02:15,370 However, there's a problem with visual data, 52 00:02:15,370 --> 00:02:17,813 and that's that it's really hard to understand. 53 00:02:17,813 --> 00:02:20,813 Sometimes we call visual data the dark matter 54 00:02:20,813 --> 00:02:24,526 of the internet in analogy with dark matter in physics. 55 00:02:24,526 --> 00:02:27,437 So for those of you who have heard of this in physics 56 00:02:27,437 --> 00:02:31,180 before, dark matter accounts for some astonishingly large 57 00:02:31,180 --> 00:02:33,377 fraction of the mass in the universe, 58 00:02:33,377 --> 00:02:35,167 and we know about it due to the existence 59 00:02:35,167 --> 00:02:38,293 of gravitational pulls on various celestial bodies 60 00:02:38,293 --> 00:02:40,535 and what not, but we can't directly observe it. 61 00:02:40,535 --> 00:02:42,838 And, visual data on the internet is much the same 62 00:02:42,838 --> 00:02:45,488 where it comprises the majority of bits 63 00:02:45,488 --> 00:02:49,164 flying around the internet, but it's very difficult 64 00:02:49,164 --> 00:02:51,313 for algorithms to actually go in and understand 65 00:02:51,313 --> 00:02:54,222 and see what exactly is comprising all the visual data 66 00:02:54,222 --> 00:02:55,685 on the web. 67 00:02:55,685 --> 00:02:58,466 Another statistic that I like is that of Youtube. 68 00:02:58,466 --> 00:03:02,309 So roughly every second of clock time 69 00:03:02,309 --> 00:03:05,303 that happens in the world, there's something like five hours 70 00:03:05,303 --> 00:03:07,746 of video being uploaded to Youtube. 71 00:03:07,746 --> 00:03:09,305 So if we just sit here and count, 72 00:03:09,305 --> 00:03:12,805 one, two, three, now there's 15 more hours 73 00:03:13,929 --> 00:03:15,596 of video on Youtube. 74 00:03:17,076 --> 00:03:18,824 Google has a lot of employees, but there's no way 75 00:03:18,824 --> 00:03:21,219 that they could ever have an employee sit down 76 00:03:21,219 --> 00:03:24,146 and watch and understand and annotate every video. 77 00:03:24,146 --> 00:03:26,856 So if they want to catalog and serve you 78 00:03:26,856 --> 00:03:29,361 relevant videos and maybe monetize by putting ads 79 00:03:29,361 --> 00:03:32,057 on those videos, it's really crucial that we develop 80 00:03:32,057 --> 00:03:34,803 technologies that can dive in and automatically understand 81 00:03:34,803 --> 00:03:37,053 the content of visual data. 82 00:03:38,649 --> 00:03:41,379 So this field of computer vision is 83 00:03:41,379 --> 00:03:44,089 truly an interdisciplinary field, and it touches 84 00:03:44,089 --> 00:03:45,864 on many different areas of science 85 00:03:45,864 --> 00:03:47,564 and engineering and technology. 86 00:03:47,564 --> 00:03:50,822 So obviously, computer vision's the center of the universe, 87 00:03:50,822 --> 00:03:53,914 but sort of as a constellation of fields 88 00:03:53,914 --> 00:03:56,453 around computer vision, we touch on areas like physics 89 00:03:56,453 --> 00:03:59,418 because we need to understand optics and image formation 90 00:03:59,418 --> 00:04:01,784 and how images are actually physically formed. 91 00:04:01,784 --> 00:04:03,995 We need to understand biology and psychology 92 00:04:03,995 --> 00:04:07,879 to understand how animal brains physically see 93 00:04:07,879 --> 00:04:09,894 and process visual information. 94 00:04:09,894 --> 00:04:12,045 We of course draw a lot on computer science, 95 00:04:12,045 --> 00:04:14,305 mathematics, and engineering as we actually strive 96 00:04:14,305 --> 00:04:16,954 to build computer systems that implement 97 00:04:16,954 --> 00:04:19,639 our computer vision algorithms. 98 00:04:19,640 --> 00:04:22,595 So a little bit more about where I'm coming from 99 00:04:22,595 --> 00:04:24,985 and about where the teaching staff of this course 100 00:04:24,985 --> 00:04:25,992 is coming from. 101 00:04:25,992 --> 00:04:30,722 Me and my co-instructor Serena are both PHD students 102 00:04:30,722 --> 00:04:33,606 in the Stanford Vision Lab which is headed 103 00:04:33,606 --> 00:04:37,184 by professor Fei-Fei Li, and our lab really focuses 104 00:04:37,184 --> 00:04:39,940 on machine learning and the computer science side 105 00:04:39,940 --> 00:04:41,184 of things. 106 00:04:41,184 --> 00:04:43,308 I work a little bit more on language and vision. 107 00:04:43,308 --> 00:04:44,900 I've done some projects in that. 108 00:04:44,900 --> 00:04:46,658 And, other folks in our group have worked 109 00:04:46,658 --> 00:04:48,525 a little bit on the neuroscience and cognitive science 110 00:04:48,525 --> 00:04:49,775 side of things. 111 00:04:52,541 --> 00:04:54,404 So as a bit of introduction, you might be curious 112 00:04:54,404 --> 00:04:57,557 about how this course relates to other courses at Stanford. 113 00:04:57,557 --> 00:05:01,408 So we kind of assume a basic introductory understanding 114 00:05:01,408 --> 00:05:02,848 of computer vision. 115 00:05:02,848 --> 00:05:04,787 So if you're kind of an undergrad, 116 00:05:04,787 --> 00:05:06,926 and you've never seen computer vision before, 117 00:05:06,926 --> 00:05:09,698 maybe you should've taken CS131 which was offered 118 00:05:09,698 --> 00:05:14,229 earlier this year by Fei-Fei and Juan Carlos Niebles. 119 00:05:14,229 --> 00:05:17,361 There was a course taught last quarter 120 00:05:17,361 --> 00:05:20,836 by Professor Chris Manning and Richard Socher 121 00:05:20,836 --> 00:05:22,705 about the intersection of deep learning 122 00:05:22,705 --> 00:05:24,925 and natural language processing. 123 00:05:24,925 --> 00:05:27,512 And, I imagine a number of you may have taken that course 124 00:05:27,512 --> 00:05:28,595 last quarter. 125 00:05:31,482 --> 00:05:33,785 There'll be some overlap between this course and that, 126 00:05:33,785 --> 00:05:35,769 but we're really focusing on the computer vision 127 00:05:35,769 --> 00:05:38,861 side of thing, and really focusing all of our motivation 128 00:05:38,861 --> 00:05:40,444 in computer vision. 129 00:05:41,361 --> 00:05:43,078 Also concurrently taught this quarter 130 00:05:43,078 --> 00:05:47,378 is CS231a taught by Professor Silvio Savarese. 131 00:05:47,378 --> 00:05:52,306 And, CS231a really focuses is a more all encompassing 132 00:05:52,306 --> 00:05:54,010 computer vision course. 133 00:05:54,010 --> 00:05:57,569 It's focusing on things like 3D reconstruction, 134 00:05:57,569 --> 00:05:59,896 on matching and robotic vision, 135 00:05:59,896 --> 00:06:01,412 and it's a bit more all encompassing 136 00:06:01,412 --> 00:06:03,813 with regards to vision than our course. 137 00:06:03,813 --> 00:06:06,647 And, this course, CS231n, really focuses 138 00:06:06,647 --> 00:06:09,358 on a particular class of algorithms revolving 139 00:06:09,358 --> 00:06:11,922 around neural networks and especially convolutional 140 00:06:11,922 --> 00:06:13,786 neural networks and their applications 141 00:06:13,786 --> 00:06:16,228 to various visual recognition tasks. 142 00:06:16,228 --> 00:06:17,725 Of course, there's also a number 143 00:06:17,725 --> 00:06:19,178 of seminar courses that are taught, 144 00:06:19,178 --> 00:06:21,154 and you'll have to check the syllabus 145 00:06:21,154 --> 00:06:24,631 and course schedule for more details on those 146 00:06:24,631 --> 00:06:27,867 'cause they vary a bit each year. 147 00:06:27,867 --> 00:06:29,914 So this lecture is normally given 148 00:06:29,914 --> 00:06:31,672 by Professor Fei-Fei Li. 149 00:06:31,672 --> 00:06:34,174 Unfortunately, she wasn't able to be here today, 150 00:06:34,174 --> 00:06:36,439 so instead for the majority of the lecture 151 00:06:36,439 --> 00:06:38,463 we're going to tag team a little bit. 152 00:06:38,463 --> 00:06:41,996 She actually recorded a bit of pre-recorded audio 153 00:06:41,996 --> 00:06:44,772 describing to you the history of computer vision 154 00:06:44,772 --> 00:06:48,229 because this class is a computer vision course, 155 00:06:48,229 --> 00:06:50,456 and it's very critical and important that you understand 156 00:06:50,456 --> 00:06:53,289 the history and the context of all the existing work 157 00:06:53,289 --> 00:06:55,183 that led us to these developments 158 00:06:55,183 --> 00:06:58,000 of convolutional neural networks as we know them today. 159 00:06:58,500 --> 00:07:00,000 I'll let virtual Fei-Fei take over 160 00:07:00,398 --> 00:07:01,915 [laughing] 161 00:07:01,915 --> 00:07:03,800 and give you a brief introduction to the history 162 00:07:04,000 --> 00:07:05,500 of computer vision. 163 00:07:08,610 --> 00:07:15,309 Okay let's start with today's agenda. So we have two topics to cover one is a 164 00:07:15,309 --> 00:07:20,620 brief history of computer vision and the other one is the overview of our course 165 00:07:20,620 --> 00:07:28,539 CS 231 so we'll start with a very brief history of where vision comes 166 00:07:28,540 --> 00:07:36,100 from when did computer vision start and where we are today. The history the 167 00:07:36,100 --> 00:07:44,770 history of vision can go back many many years ago in fact about 543 million 168 00:07:44,770 --> 00:07:50,800 years ago. What was life like during that time? Well the earth was mostly water 169 00:07:50,920 --> 00:07:58,300 there were a few species of animals floating around in the ocean and life 170 00:07:58,300 --> 00:08:03,730 was very chill. Animals didn't move around much there they don't have eyes or 171 00:08:03,730 --> 00:08:09,640 anything when food swims by they grab them if the food didn't swim by they 172 00:08:09,640 --> 00:08:17,140 just float around but something really remarkable happened around 540 million 173 00:08:17,140 --> 00:08:25,509 years ago. From fossil studies zoologists found out within a very short period of 174 00:08:25,509 --> 00:08:33,820 time — ten million years — the number of animal species just exploded. It went 175 00:08:33,820 --> 00:08:41,500 from a few of them to hundreds of thousands and that was strange — what caused this? 176 00:08:41,500 --> 00:08:47,920 There were many theories but for many years it was a mystery evolutionary 177 00:08:47,920 --> 00:08:55,540 biologists call this evolution's Big Bang. A few years ago an Australian zoologist 178 00:08:55,540 --> 00:09:01,299 called Andrew Parker proposed one of the most convincing theory from the studies 179 00:09:01,299 --> 00:09:07,030 of fossils he discovered around 540 million years 180 00:09:07,030 --> 00:09:19,310 ago the first animals developed eyes and the onset of vision started this 181 00:09:19,310 --> 00:09:26,610 explosive speciation phase. Animals can suddenly see; once you can see life 182 00:09:26,610 --> 00:09:32,580 becomes much more proactive. Some predators went after prey and prey 183 00:09:32,580 --> 00:09:39,980 have to escape from predators so the evolution or onset of vision started a 184 00:09:39,980 --> 00:09:46,860 evolutionary arms race and animals had to evolve quickly in order to survive as 185 00:09:46,860 --> 00:09:54,870 a species so that was the beginning of vision in animals after 540 million 186 00:09:54,870 --> 00:10:01,380 years vision has developed into the biggest sensory system of almost all 187 00:10:01,380 --> 00:10:09,660 animals especially intelligent animals in humans we have almost 50% of the 188 00:10:09,660 --> 00:10:15,450 neurons in our cortex involved in visual processing it is the biggest sensory 189 00:10:15,450 --> 00:10:22,590 system that enables us to survive, work, move around, manipulate things, 190 00:10:22,590 --> 00:10:29,730 communicate, entertain, and many things. The vision is really important for 191 00:10:29,730 --> 00:10:38,930 animals and especially intelligent animals. So that was a quick story of 192 00:10:38,930 --> 00:10:48,329 biological vision. What about humans, the history of humans making mechanical 193 00:10:48,329 --> 00:10:56,450 vision or cameras? Well one of the early cameras that we know today is from the 194 00:10:56,450 --> 00:11:04,410 1600s, the Renaissance period of time, camera obscura and this is a camera 195 00:11:04,410 --> 00:11:13,730 based on pinhole camera theories. It's very similar to, it's very similar to the 196 00:11:13,730 --> 00:11:21,390 to the early eyes that animals developed with a hole that collects lights 197 00:11:21,390 --> 00:11:28,020 and then a plane in the back of the camera that collects the information and 198 00:11:28,020 --> 00:11:36,560 project the imagery. So as cameras evolved, today we have cameras 199 00:11:36,560 --> 00:11:40,910 everywhere this is one of the most popular sensors people use from 200 00:11:40,910 --> 00:11:49,040 smartphones to to other sensors. In the mean time biologists started 201 00:11:49,040 --> 00:11:56,510 studying the mechanism of vision. One of the most influential work in both human 202 00:11:56,510 --> 00:12:02,690 vision where animal vision as well as that inspired computer vision is the 203 00:12:02,690 --> 00:12:10,850 work done by Hubel and Wiesel in the 50s and 60s using electrophysiology. 204 00:12:10,850 --> 00:12:18,170 What they were asking, the question is "what was the visual processing mechanism like 205 00:12:18,170 --> 00:12:26,600 in primates, in mammals" so they chose to study cat brain which is more or less 206 00:12:26,600 --> 00:12:32,090 similar to human brain from a visual processing point of view. What they did 207 00:12:32,090 --> 00:12:37,490 is to stick some electrodes in the back of the cat brain which is where the 208 00:12:37,490 --> 00:12:45,830 primary visual cortex area is and then look at what stimuli makes the neurons 209 00:12:45,830 --> 00:12:52,970 in the in the back in the primary visual cortex of cat brain respond excitedly 210 00:12:52,970 --> 00:13:00,380 what they learned is that there are many types of cells in the, in the primary 211 00:13:00,380 --> 00:13:05,630 visual cortex part of the the cat brain but one of the most important cell is 212 00:13:05,630 --> 00:13:12,080 the simple cells they respond to oriented edges when they move in certain 213 00:13:12,080 --> 00:13:18,410 directions. Of course there are also more complex cells but by and large what they 214 00:13:18,410 --> 00:13:26,060 discovered is visual processing starts with simple structure of the visual world, 215 00:13:26,060 --> 00:13:32,210 oriented edges and as information moves along the visual processing 216 00:13:32,210 --> 00:13:38,560 pathway the brain builds up the complexity of the visual information 217 00:13:38,560 --> 00:13:46,280 until it can recognize the complex visual world. So the history of 218 00:13:46,280 --> 00:13:55,070 computer vision also starts around early 60s. Block World is a set of work 219 00:13:55,070 --> 00:14:00,410 published by Larry Roberts which is widely known as one of the first, 220 00:14:00,410 --> 00:14:07,250 probably the first PhD thesis of computer vision where the visual world 221 00:14:07,250 --> 00:14:13,850 was simplified into simple geometric shapes and the goal is to be able to 222 00:14:13,850 --> 00:14:23,419 recognize them and reconstruct what these shapes are. In 1966 there was a now 223 00:14:23,419 --> 00:14:31,550 famous MIT summer project called "The Summer Vision Project." The goal of this 224 00:14:31,550 --> 00:14:38,440 Summer Vision Project, I read: "is an attempt to use our summer workers 225 00:14:38,440 --> 00:14:44,240 effectively in a construction of a significant part of a visual system." 226 00:14:44,240 --> 00:14:47,780 So the goal is in one summer we're gonna work out 227 00:14:47,780 --> 00:14:54,590 the bulk of the visual system. That was an ambitious goal. Fifty years have 228 00:14:54,590 --> 00:15:02,240 passed; the field of computer vision has blossomed from one summer project into a 229 00:15:02,240 --> 00:15:07,610 field of thousands of researchers worldwide still working on some of the 230 00:15:07,610 --> 00:15:13,940 most fundamental problems of vision. We still have not yet solved vision but it 231 00:15:13,940 --> 00:15:21,380 has grown into one of the most important and fastest growing areas 232 00:15:21,380 --> 00:15:27,410 of artificial intelligence. Another person that we should pay tribute to is 233 00:15:27,410 --> 00:15:34,550 David Marr. David Marr was a MIT vision scientist and he has written an 234 00:15:34,550 --> 00:15:41,510 influential book in the late 70s about what he thinks vision is and how we 235 00:15:41,510 --> 00:15:48,200 should go about computer vision and developing algorithms that can 236 00:15:48,200 --> 00:15:57,020 enable computers to recognize the visual world. The thought process in his, 237 00:15:57,020 --> 00:16:02,440 in David Mars book is that in order to take an image and 238 00:16:02,440 --> 00:16:10,639 arrive at a final holistic full 3d representation of the visual world we 239 00:16:10,640 --> 00:16:16,360 have to go through several process. The first process is what he calls "primal sketch;" 240 00:16:16,360 --> 00:16:23,060 this is where mostly the edges, the bars, the ends, the virtual lines, the 241 00:16:23,060 --> 00:16:28,970 curves, the boundaries, are represented and this is very much inspired by what 242 00:16:28,970 --> 00:16:34,639 neuroscientists have seen: Hubel and Wiesel told us the early stage of visual 243 00:16:34,639 --> 00:16:41,420 processing has a lot to do with simple structures like edges. Then the next step 244 00:16:41,420 --> 00:16:45,860 after the edges and the curves is what David Marr calls 245 00:16:45,860 --> 00:16:52,300 "two-and-a-half d sketch;" this is where we start to piece together the surfaces, 246 00:16:52,300 --> 00:16:58,840 the depth information, the layers, or the discontinuities of the visual scene, 247 00:16:58,850 --> 00:17:04,930 and then eventually we put everything together and have a 3d model 248 00:17:04,930 --> 00:17:11,579 hierarchically organized in terms of surface and volumetric primitives and so on. 249 00:17:11,579 --> 00:17:20,719 So that was a very idealized thought process of what vision is and this way 250 00:17:20,720 --> 00:17:25,790 of thinking actually has dominated computer vision for several decades and 251 00:17:25,790 --> 00:17:31,940 is also a very intuitive way for students to enter the field of vision 252 00:17:31,940 --> 00:17:38,230 and think about how we can deconstruct the visual information. 253 00:17:39,310 --> 00:17:48,380 Another very important seminal group of work happened in the 70s where people 254 00:17:48,380 --> 00:17:55,160 began to ask the question "how can we move beyond the simple block world and 255 00:17:55,160 --> 00:18:02,509 start recognizing or representing real world objects?" Think about the 70s, 256 00:18:02,509 --> 00:18:07,910 it's the time that there's very little data available; computers are extremely 257 00:18:07,910 --> 00:18:13,360 slow, PCs are not even around, but computer scientists are starting to 258 00:18:13,360 --> 00:18:20,170 think about how we can recognize and represent objects. So in Palo Alto 259 00:18:20,170 --> 00:18:26,649 both at Stanford as well as SRI, two groups of scientists that propose 260 00:18:26,649 --> 00:18:32,740 similar ideas: one is called "generalized cylinder," one is called "pictorial structure." 261 00:18:32,740 --> 00:18:40,060 The basic idea is that every object is composed of simple geometric 262 00:18:40,060 --> 00:18:45,510 primitives; for example a person can be pieced together by generalized 263 00:18:45,510 --> 00:18:51,339 cylindrical shapes or a person can be pieced together by critical part in 264 00:18:51,339 --> 00:18:56,079 their elastic distance between these parts 265 00:18:56,079 --> 00:19:03,880 so either representation is a way to reduce the complex structure of the 266 00:19:03,880 --> 00:19:11,140 object into a collection of simpler shapes and their geometric configuration. 267 00:19:11,140 --> 00:19:19,220 These work have been influential for quite a few, quite a few years 268 00:19:19,220 --> 00:19:27,630 and then in the 80s David Lowe, here is another example of thinking how to 269 00:19:27,630 --> 00:19:33,699 reconstruct or recognize the visual world from simple world structures, this 270 00:19:33,699 --> 00:19:43,440 work is by David Lowe which he tries to recognize razors by constructing 271 00:19:43,440 --> 00:19:50,860 lines and edges and and mostly straight lines and their combination. 272 00:19:50,860 --> 00:20:01,140 So there was a lot of effort in trying to think what what is the tasks in computer 273 00:20:01,149 --> 00:20:10,410 vision in the 60s 70s and 80s and frankly it was very hard to solve the problem of 274 00:20:10,410 --> 00:20:17,980 object recognition; everything I've shown you so far are very audacious ambitious 275 00:20:17,980 --> 00:20:24,160 attempts but they remain at the level of toy examples 276 00:20:24,160 --> 00:20:30,819 or just a few examples. Not a lot of progress have been made in terms of 277 00:20:30,819 --> 00:20:38,019 delivering something that can work in real world. So as people think about what 278 00:20:38,019 --> 00:20:43,709 are the problems to solving vision one important question came around is: 279 00:20:43,709 --> 00:20:50,200 if object recognition is too hard, maybe we should first do object segmentation, 280 00:20:50,200 --> 00:20:58,760 that is the task of taking an image and group the pixels into meaningful areas. 281 00:20:58,760 --> 00:21:03,880 We might not know the pixels that group together is called a person, 282 00:21:03,880 --> 00:21:10,140 but we can extract out all the pixels that belong to the person from its background; 283 00:21:10,140 --> 00:21:15,339 that is called image segmentation. So here's one very early 284 00:21:15,339 --> 00:21:21,759 seminal work by Jitendra Malik and his student Jianbo Shi from Berkeley from 285 00:21:21,760 --> 00:21:29,880 using a graph theory algorithm for the problem of image segmentation. 286 00:21:29,880 --> 00:21:39,600 Here's another problem that made some headway ahead of many other problems in 287 00:21:39,610 --> 00:21:45,850 computer vision, which is face detection. Faces one of the most important objects 288 00:21:45,850 --> 00:21:51,779 to humans, probably the most important objects to humans, around the time of 289 00:21:51,779 --> 00:21:59,079 1999 to 2000 machine learning techniques, especially statistical machine 290 00:21:59,079 --> 00:22:05,220 learning techniques start to gain momentum. These are techniques such as 291 00:22:05,220 --> 00:22:11,620 support vector machines, boosting, graphical models, including the first 292 00:22:11,620 --> 00:22:18,449 wave of neural networks. One particular work that made a lot of contribution was 293 00:22:18,449 --> 00:22:24,939 using AdaBoost algorithm to do real-time face detection by Paul Viola 294 00:22:24,939 --> 00:22:31,779 and Michael Jones and there's a lot to admire in this work. It was done in 2001 295 00:22:31,779 --> 00:22:36,730 when computer chips are still very very slow but they're able to do face 296 00:22:36,730 --> 00:22:42,550 detection in images in near-real-time and after the 297 00:22:42,550 --> 00:22:50,800 publication of this paper in five years time, 2006, Fujifilm rolled out the first 298 00:22:50,800 --> 00:22:58,960 digital camera that has a real-time face detector in the in the camera so it 299 00:22:58,960 --> 00:23:05,960 was a very rapid transfer from basic science research to real world application. 300 00:23:05,960 --> 00:23:13,920 So as a field we continue to explore how we can do object recognition 301 00:23:13,930 --> 00:23:22,720 better so one of the very influential way of thinking in the late 90s til the 302 00:23:22,720 --> 00:23:31,300 first 10 years of 2000 is feature based object recognition and here is a seminal 303 00:23:31,300 --> 00:23:39,670 work by David Lowe called SIFT feature. The idea is that to match and the entire object 304 00:23:39,670 --> 00:23:44,860 for example here is a stop sign to another stop sight is very difficult 305 00:23:44,860 --> 00:23:51,060 because there might be all kinds of changes due to camera angles, occlusion, 306 00:23:51,060 --> 00:23:57,210 viewpoint, lighting, and just the intrinsic variation of the object itself 307 00:23:57,210 --> 00:24:04,680 but it's inspired to observe that there are some parts of the object, 308 00:24:04,680 --> 00:24:15,000 some features, that tend to remain diagnostic and invariant to changes so the task of 309 00:24:15,010 --> 00:24:21,610 object recognition began with identifying these critical features on the object 310 00:24:21,610 --> 00:24:28,569 and then match the features to a similar object, that's a easier task than pattern 311 00:24:28,569 --> 00:24:36,070 matching the entire object. So here is a figure from his paper where it shows 312 00:24:36,070 --> 00:24:42,060 that a handful, several dozen SIFT features from one stop sign are 313 00:24:42,060 --> 00:24:49,440 identified and matched to the SIFT features of another stop sign. 314 00:24:51,130 --> 00:24:59,330 Using the same building block which is features, diagnostic features in images, 315 00:24:59,330 --> 00:25:04,780 we have as a field has made another step forward and start to recognizing 316 00:25:04,780 --> 00:25:12,320 holistic scenes. Here is an example algorithm called Spatial Pyramid Matching; 317 00:25:12,320 --> 00:25:18,620 the idea is that there are features in the images that can give us 318 00:25:18,620 --> 00:25:23,750 clues about which type of scene it is, whether it's a landscape or a kitchen or 319 00:25:23,750 --> 00:25:31,580 a highway and so on and this particular work takes these features from different 320 00:25:31,580 --> 00:25:37,130 parts of the image and in different resolutions and put them together in a 321 00:25:37,130 --> 00:25:44,780 feature descriptor and then we do support vector machine algorithm on top of that. 322 00:25:44,780 --> 00:25:53,930 Similarly a very similar work has gained momentum in human recognition 323 00:25:53,930 --> 00:26:02,990 so putting together these features well we have a number of work that looks at 324 00:26:02,990 --> 00:26:10,490 how we can compose human bodies in more realistic images and recognize them. 325 00:26:10,490 --> 00:26:15,710 So one work is called the "histogram of gradients," another work is called 326 00:26:15,710 --> 00:26:26,770 "deformable part models," so as you can see as we move from the 60s 70s 80s 327 00:26:26,770 --> 00:26:34,160 towards the first decade of the 21st century one thing is changing and that's 328 00:26:34,160 --> 00:26:40,700 the quality of the pictures were no longer, with the Internet the the the 329 00:26:40,700 --> 00:26:45,680 growth of the Internet the digital cameras were having better and better 330 00:26:45,680 --> 00:26:54,380 data to study computer vision. So one of the outcome in the early 2000s is that 331 00:26:54,380 --> 00:27:02,840 the field of computer vision has defined a very important building block problem to solve. 332 00:27:02,840 --> 00:27:05,600 It's not the only problem to solve but 333 00:27:05,600 --> 00:27:11,120 in terms of recognition this is a very important problem to solve which is 334 00:27:11,120 --> 00:27:18,950 object recognition. I talked about object recognition all along but in the early 335 00:27:18,950 --> 00:27:26,600 2000s we began to have benchmark data set that can enable us to measure the 336 00:27:26,600 --> 00:27:32,930 progress of object recognition. One of the most influential benchmark data set 337 00:27:32,930 --> 00:27:41,480 is called PASCAL Visual Object Challenge, and it's a data set composed of 20 338 00:27:41,480 --> 00:27:48,500 object classes, three of them are shown here: train, airplane, person; I think it 339 00:27:48,500 --> 00:27:57,440 also has cows, bottles, cats, and so on; and the data set is composed of several 340 00:27:57,440 --> 00:28:04,280 thousand to ten thousand images per category and then the field different 341 00:28:04,280 --> 00:28:11,750 groups develop algorithm to test against the testing set and see how we 342 00:28:11,750 --> 00:28:19,870 have made progress. So here is a figure that shows from year 2007 to year 2012. 343 00:28:19,870 --> 00:28:31,100 The performance on detecting objects the 20 object in this image in a in a 344 00:28:31,100 --> 00:28:38,680 benchmark data set has steadily increased. So there was a lot of progress made. 345 00:28:38,680 --> 00:28:45,170 Around that time a group of us from Princeton to Stanford also began to ask 346 00:28:45,170 --> 00:28:53,330 a harder question to ourselves as well as our field which is: are we ready 347 00:28:53,330 --> 00:29:00,260 to recognize every object or most of the object in the world. It's also motivated 348 00:29:00,260 --> 00:29:07,970 by an observation that is rooted in machine learning which is that most of 349 00:29:07,970 --> 00:29:12,410 the machine learning algorithms it doesn't matter if it's graphical model, 350 00:29:12,410 --> 00:29:20,070 or support vector machine, or AdaBoost, is very likely to overfit in 351 00:29:20,070 --> 00:29:25,410 the training process and part of the problem is visual data is very complex 352 00:29:25,410 --> 00:29:32,700 because it's complex our models tend to have a high dimension a high dimension 353 00:29:32,700 --> 00:29:37,559 of input and have to have a lot of parameters to fit and when we don't have 354 00:29:37,559 --> 00:29:44,160 enough training data overfitting happens very fast and then we cannot generalize 355 00:29:44,160 --> 00:29:52,440 very well. So motivated by this dual reason, one is just want to recognize the 356 00:29:52,440 --> 00:29:58,340 world of all the objects, the other one is to come back the machine learning 357 00:29:58,340 --> 00:30:04,620 overcome the the machine learning bottleneck of overfitting, we began this 358 00:30:04,620 --> 00:30:11,140 project called ImageNet. We wanted to put together the largest possible dataset 359 00:30:11,140 --> 00:30:17,900 of all the pictures we can find, the world of objects, and use that for 360 00:30:17,910 --> 00:30:23,250 training as well as for benchmarking. So it was a project that took us about 361 00:30:23,250 --> 00:30:30,330 three years, lots of hard work; it basically began with downloading 362 00:30:30,330 --> 00:30:37,620 billions of images from the internet organized by the dictionary we called 363 00:30:37,620 --> 00:30:45,770 WordNet which is tens of thousands of object classes and then we have to use 364 00:30:45,770 --> 00:30:52,230 some clever crowd engineering trick a method using Amazon Mechanical Turk 365 00:30:52,230 --> 00:31:02,270 platform to sort, clean, label each of the images. The end result is a ImageNet of 366 00:31:02,270 --> 00:31:10,830 almost 15 million or 40 million plus images organized in twenty-two thousand 367 00:31:10,830 --> 00:31:20,880 categories of objects and scenes and this is the gigantic, probably the 368 00:31:20,880 --> 00:31:29,289 biggest dataset produced in the field of AI at that time and it began to push 369 00:31:29,289 --> 00:31:35,759 forward the algorithm development of object recognition into another phase. 370 00:31:35,759 --> 00:31:41,200 Especially important is how to benchmark the progress 371 00:31:41,200 --> 00:31:49,419 so starting 2009 the ImageNet team rolled out an international challenge called 372 00:31:49,419 --> 00:31:57,309 ImageNet Large-Scale Visual Recognition Challenge and for this challenge we put 373 00:31:57,309 --> 00:32:06,190 together a more stringent test set of 1.4 million objects across 1,000 object 374 00:32:06,190 --> 00:32:13,629 classes and this is to test the image classification recognition results for 375 00:32:13,629 --> 00:32:21,989 the computer vision algorithms. So here's the example picture and if an algorithm 376 00:32:21,989 --> 00:32:32,259 can output 5 labels and and top five labels includes the correct object in 377 00:32:32,259 --> 00:32:42,909 this picture then we call this a success. So here is a result summary of the 378 00:32:42,909 --> 00:32:49,720 ImageNet Challenge, of the image classification result from 2010 379 00:32:49,720 --> 00:33:00,740 to 2015 so on x axis you see the years and the y axis you see the error rate. 380 00:33:00,740 --> 00:33:06,820 So the good news is the error rate is steadily decreasing to the point by 381 00:33:06,820 --> 00:33:15,369 2012 the error rate is so low is on par with what humans can do and here a human 382 00:33:15,369 --> 00:33:25,359 I mean a single Stanford PhD student who spend weeks doing this task as if 383 00:33:25,359 --> 00:33:32,470 he were a computer participating in the ImageNet Challenge. So that's a lot of 384 00:33:32,470 --> 00:33:39,669 progress made even though we have not solved all the problems of object 385 00:33:39,669 --> 00:33:43,110 recognition which you'll learn about in this class 386 00:33:43,110 --> 00:33:50,490 but to go from an error rate that's unacceptable for real-world application 387 00:33:50,490 --> 00:33:56,400 all the way to on par being on par with humans in ImageNet challenge, the field 388 00:33:56,400 --> 00:34:05,640 took only a few years. And one particular moment you should notice on this graph 389 00:34:05,640 --> 00:34:15,719 is the the year 2012. In the first two years our error rate hovered around 25 390 00:34:15,719 --> 00:34:25,649 percent but in 2012 the error rate was dropped more almost 10 percent to 16 391 00:34:25,650 --> 00:34:32,969 percent even though now it's better but that drop was very significant and the 392 00:34:32,969 --> 00:34:42,569 winning algorithm of that year is a convolutional neural network model that 393 00:34:42,570 --> 00:34:49,850 beat all other algorithms around that time to win the ImageNet challenge and 394 00:34:49,850 --> 00:34:58,200 this is the focus of our whole course this quarter is to look at to have a 395 00:34:58,200 --> 00:35:05,700 deep dive into what convolutional neural network models are and another name for 396 00:35:05,700 --> 00:35:10,370 this is deep learning by by popular 397 00:35:10,520 --> 00:35:15,330 popular name now it's called deep learning and to look at what these 398 00:35:15,330 --> 00:35:20,429 models are what are the principles what are the good practices what are the 399 00:35:20,429 --> 00:35:26,400 recent progress of this model, but here is where the history was made is 400 00:35:26,400 --> 00:35:33,000 that we, around 2012 convolutional neural network model or deep learning 401 00:35:33,000 --> 00:35:41,309 models showed the tremendous capacity and ability in making a good progress in 402 00:35:41,309 --> 00:35:47,370 the field of computer vision along with several other sister fields like natural 403 00:35:47,370 --> 00:35:51,900 language processing and speech recognition. So without further ado I'm 404 00:35:51,900 --> 00:36:00,630 going to hand the rest of the lecture to to Justin to talk about the overview of 405 00:36:00,630 --> 00:36:02,500 CS 231n. 406 00:36:03,000 --> 00:36:04,763 Alright, thanks so much Fei-Fei. 407 00:36:05,000 --> 00:36:08,158 I'll take it over from here. 408 00:36:08,189 --> 00:36:09,910 So now I want to shift gears a little bit 409 00:36:09,910 --> 00:36:14,077 and talk a little bit more about this class CS231n. 410 00:36:15,436 --> 00:36:18,636 So this class focuses on one of these most, 411 00:36:18,636 --> 00:36:20,814 so the primary focus of this class 412 00:36:20,814 --> 00:36:22,950 is this image classification problem 413 00:36:22,950 --> 00:36:25,269 which we previewed a little bit in the contex 414 00:36:25,269 --> 00:36:27,037 of the ImageNet Challenge. 415 00:36:27,037 --> 00:36:28,848 So in image classification, again, 416 00:36:28,848 --> 00:36:31,470 the setup is that your algorithm looks at an image 417 00:36:31,470 --> 00:36:34,048 and then picks from among some fixed set of categories 418 00:36:34,048 --> 00:36:36,443 to classify that image. 419 00:36:36,443 --> 00:36:39,550 And, this might seem like somewhat of a restrictive 420 00:36:39,550 --> 00:36:42,506 or artificial setup, but it's actual quite general. 421 00:36:42,506 --> 00:36:45,521 And, this problem can be applied in many different settings 422 00:36:45,521 --> 00:36:49,630 both in industry and academia and many different places. 423 00:36:49,630 --> 00:36:52,957 So for example, you could apply this to recognizing food 424 00:36:52,957 --> 00:36:54,906 or recognizing calories in food or recognizing 425 00:36:54,906 --> 00:36:58,043 different artworks, different product out in the world. 426 00:36:58,043 --> 00:37:01,576 So this relatively basic tool of image classification 427 00:37:01,576 --> 00:37:04,272 is super useful on its own and could be applied 428 00:37:04,272 --> 00:37:08,503 all over the place for many different applications. 429 00:37:08,503 --> 00:37:10,685 But, in this course, we're also going to talk 430 00:37:10,685 --> 00:37:13,806 about several other visual recognition problems 431 00:37:13,806 --> 00:37:16,673 that build upon many of the tools that we develop 432 00:37:16,673 --> 00:37:19,660 for the purpose of image classification. 433 00:37:19,660 --> 00:37:21,266 We'll talk about other problems 434 00:37:21,266 --> 00:37:24,783 such as object detection or image captioning. 435 00:37:24,783 --> 00:37:26,665 So the setup in object detection 436 00:37:26,665 --> 00:37:28,435 is a little bit different. 437 00:37:28,435 --> 00:37:30,709 Rather than classifying an entire image 438 00:37:30,709 --> 00:37:33,727 as a cat or a dog or a horse or whatnot, 439 00:37:33,727 --> 00:37:35,851 instead we want to go in and draw bounding boxes 440 00:37:35,851 --> 00:37:38,461 and say that there is a dog here, and a cat here, 441 00:37:38,461 --> 00:37:40,351 and a car over in the background, 442 00:37:40,351 --> 00:37:42,186 and draw these boxes describing 443 00:37:42,186 --> 00:37:44,110 where objects are in the image. 444 00:37:44,110 --> 00:37:46,322 We'll also talk about image captioning 445 00:37:46,322 --> 00:37:47,745 where given an image the system 446 00:37:47,745 --> 00:37:50,111 now needs to produce a natural language sentence 447 00:37:50,111 --> 00:37:51,475 describing the image. 448 00:37:51,475 --> 00:37:53,691 It sounds like a really hard, complicated, 449 00:37:53,691 --> 00:37:55,599 and different problem, but we'll see 450 00:37:55,599 --> 00:37:57,219 that many of the tools that we develop 451 00:37:57,219 --> 00:37:58,963 in service of image classification 452 00:37:58,963 --> 00:38:02,880 will be reused in these other problems as well. 453 00:38:06,482 --> 00:38:08,451 So we mentioned this before in the context 454 00:38:08,451 --> 00:38:11,245 of the ImageNet Challenge, but one of the things 455 00:38:11,245 --> 00:38:12,966 that's really driven the progress of the field 456 00:38:12,966 --> 00:38:14,398 in recent years has been this adoption 457 00:38:14,398 --> 00:38:17,933 of convolutional neural networks or CNNs 458 00:38:17,933 --> 00:38:20,350 or sometimes called convnets. 459 00:38:20,350 --> 00:38:24,008 So if we look at the algorithms that have won 460 00:38:24,008 --> 00:38:26,827 the ImageNet Challenge for the last several years, 461 00:38:26,827 --> 00:38:30,479 in 2011 we see this method from Lin et al 462 00:38:30,479 --> 00:38:32,631 which is still hierarchical. 463 00:38:32,631 --> 00:38:34,860 It consists of multiple layers. 464 00:38:34,860 --> 00:38:36,769 So first we compute some features, 465 00:38:36,769 --> 00:38:38,742 next we compute some local invariances, 466 00:38:38,742 --> 00:38:41,211 some pooling, and go through several layers 467 00:38:41,211 --> 00:38:42,939 of processing, and then finally feed 468 00:38:42,939 --> 00:38:46,276 this resulting descriptor to a linear SVN. 469 00:38:46,276 --> 00:38:49,230 What you'll notice here is that this is still hierarchical. 470 00:38:49,230 --> 00:38:50,553 We're still detecting edges. 471 00:38:50,553 --> 00:38:52,583 We're still having notions of invariance. 472 00:38:52,583 --> 00:38:54,411 And, many of these intuitions will carry over 473 00:38:54,411 --> 00:38:56,177 into convnets. 474 00:38:56,177 --> 00:38:59,115 But, the breakthrough moment was really in 2012 475 00:38:59,115 --> 00:39:02,032 when Jeff Hinton's group in Toronto 476 00:39:03,693 --> 00:39:07,066 together with Alex Krizhevsky and Ilya Sutskever 477 00:39:07,066 --> 00:39:09,225 who were his PHD student at that time 478 00:39:09,225 --> 00:39:12,504 created this seven layer convolutional neural network 479 00:39:12,504 --> 00:39:15,212 now known as AlexNet, then called Supervision 480 00:39:15,212 --> 00:39:18,169 which just did very, very well in the ImageNet competition 481 00:39:18,169 --> 00:39:19,651 in 2012. 482 00:39:19,651 --> 00:39:22,484 And, since then every year the winner of ImageNet 483 00:39:22,484 --> 00:39:24,197 has been a neural network. 484 00:39:24,197 --> 00:39:25,911 And, the trend has been that these networks 485 00:39:25,911 --> 00:39:28,096 are getting deeper and deeper each year. 486 00:39:28,096 --> 00:39:31,561 So AlexNet was a seven or eight layer neural network 487 00:39:31,561 --> 00:39:33,592 depending on how exactly you count things. 488 00:39:33,592 --> 00:39:35,561 In 2015 we had these much deeper networks. 489 00:39:35,561 --> 00:39:39,518 GoogleNet from Google and VGG, the VGG network 490 00:39:39,518 --> 00:39:43,172 from Oxford which was about 19 layers at that time. 491 00:39:43,172 --> 00:39:44,971 And, then in 2015 it got really crazy 492 00:39:44,971 --> 00:39:48,598 and this paper came out from Microsoft Research Asia 493 00:39:48,598 --> 00:39:52,373 called Residual Networks which were 152 layers at that time. 494 00:39:52,373 --> 00:39:55,037 And, since then it turns out you can get 495 00:39:55,037 --> 00:39:56,745 a little bit better if you go up to 200, 496 00:39:56,745 --> 00:39:58,505 but you run our of memory on your GPUs. 497 00:39:58,505 --> 00:40:00,352 We'll get into all of that later, 498 00:40:00,352 --> 00:40:03,096 but the main takeaway here is that convolutional neural 499 00:40:03,096 --> 00:40:04,824 networks really had this breakthrough moment 500 00:40:04,824 --> 00:40:06,825 in 2012, and since then there's been 501 00:40:06,825 --> 00:40:08,783 a lot of effort focused in tuning and tweaking 502 00:40:08,783 --> 00:40:11,340 these algorithms to make them perform better and better 503 00:40:11,340 --> 00:40:13,479 on this problem of image classification. 504 00:40:13,479 --> 00:40:15,479 And, throughout the rest of the quarter, 505 00:40:15,479 --> 00:40:17,100 we're going to really dive in deep, 506 00:40:17,100 --> 00:40:19,116 and you'll understand exactly how these different models 507 00:40:19,116 --> 00:40:19,949 work. 508 00:40:22,514 --> 00:40:24,665 But, one point that's really important, 509 00:40:24,665 --> 00:40:27,348 it's true that the breakthrough moment 510 00:40:27,348 --> 00:40:30,260 for convolutional neural networks was in 2012 511 00:40:30,260 --> 00:40:32,394 when these networks performed very well 512 00:40:32,394 --> 00:40:34,822 on the ImageNet Challenge, but they certainly weren't 513 00:40:34,822 --> 00:40:36,551 invented in 2012. 514 00:40:36,551 --> 00:40:38,186 These algorithms had actually been around 515 00:40:38,186 --> 00:40:40,310 for quite a long time before that. 516 00:40:40,310 --> 00:40:43,796 So one of the sort of foundational works 517 00:40:43,796 --> 00:40:46,157 in this area of convolutional neural networks 518 00:40:46,157 --> 00:40:50,450 was actually in the '90s from Jan LeCun and collaborators 519 00:40:50,450 --> 00:40:53,633 who at that time were at Bell Labs. 520 00:40:53,633 --> 00:40:57,332 So in 1998 they build this convolutional neural network 521 00:40:57,332 --> 00:40:58,829 for recognizing digits. 522 00:40:58,829 --> 00:41:02,591 They wanted to deploy this and wanted to be able 523 00:41:02,591 --> 00:41:04,668 to automatically recognize handwritten checks 524 00:41:04,668 --> 00:41:07,366 or addresses for the post office. 525 00:41:07,366 --> 00:41:09,384 And, they built this convolutional neural network 526 00:41:09,384 --> 00:41:11,658 which could take in the pixels of an image 527 00:41:11,658 --> 00:41:14,582 and then classify either what digit it was 528 00:41:14,582 --> 00:41:17,237 or what letter it was or whatnot. 529 00:41:17,237 --> 00:41:19,206 And, the structure of this network 530 00:41:19,206 --> 00:41:21,206 actually look pretty similar to the AlexNet 531 00:41:21,206 --> 00:41:23,618 architecture that was used in 2012. 532 00:41:23,618 --> 00:41:25,449 Here we see that, you know, we're taking 533 00:41:25,449 --> 00:41:26,678 in these raw pixels. 534 00:41:26,678 --> 00:41:29,080 We have many layers of convolution and sub-sampling, 535 00:41:29,080 --> 00:41:31,398 together with the so called fully connected layers. 536 00:41:31,398 --> 00:41:33,395 All of which will be explained in much more detail 537 00:41:33,395 --> 00:41:34,714 later in the course. 538 00:41:34,714 --> 00:41:36,716 But, if you just kind of look at these two pictures, 539 00:41:36,716 --> 00:41:38,397 they look pretty similar. 540 00:41:38,397 --> 00:41:41,730 And, this architecture in 2012 has a lot 541 00:41:42,609 --> 00:41:44,449 of these architectural similarities 542 00:41:44,449 --> 00:41:49,299 that are shared with this network going back to the '90s. 543 00:41:49,299 --> 00:41:50,816 So then the question you might ask 544 00:41:50,816 --> 00:41:53,377 is if these algorithms were around since the '90s, 545 00:41:53,377 --> 00:41:55,815 why have they only suddenly become popular 546 00:41:55,815 --> 00:41:57,454 in the last couple of years? 547 00:41:57,454 --> 00:41:59,303 And, there's a couple really key innovations 548 00:41:59,303 --> 00:42:03,277 that happened that have changed since the '90s. 549 00:42:03,277 --> 00:42:05,351 One is computation. 550 00:42:05,351 --> 00:42:07,021 Thanks to Moore's law, we've gotten 551 00:42:07,021 --> 00:42:09,217 faster and faster computers every year. 552 00:42:09,217 --> 00:42:11,233 And, this is kind of a coarse measure, 553 00:42:11,233 --> 00:42:13,234 but if you just look at the number of transistors 554 00:42:13,234 --> 00:42:15,129 that are on chips, then that has grown 555 00:42:15,129 --> 00:42:18,574 by several orders of magnitude between the '90s and today. 556 00:42:18,574 --> 00:42:23,043 We've also had this advent of graphics processing units 557 00:42:23,043 --> 00:42:25,878 or GPUs which are super parallelizable 558 00:42:25,878 --> 00:42:28,105 and ended up being a perfect tool 559 00:42:28,105 --> 00:42:30,866 for really crunching these computationally intensive 560 00:42:30,866 --> 00:42:33,032 convolutional neural network models. 561 00:42:33,032 --> 00:42:35,941 So just by having more compute available, 562 00:42:35,941 --> 00:42:39,724 it allowed researchers to explore with larger architectures 563 00:42:39,724 --> 00:42:42,150 and larger models, and in some cases, 564 00:42:42,150 --> 00:42:44,126 just increasing the model size, but still using 565 00:42:44,126 --> 00:42:46,838 these kind of classical approaches and classical algorithms 566 00:42:46,838 --> 00:42:48,476 tends to work quite well. 567 00:42:48,476 --> 00:42:51,415 So this idea of increasing computation 568 00:42:51,415 --> 00:42:55,554 is super important in the history of deep learning. 569 00:42:55,554 --> 00:42:58,647 I think the second key innovation that changed 570 00:42:58,647 --> 00:43:00,559 between now and the '90s was data. 571 00:43:00,559 --> 00:43:04,258 So these algorithms are very hungry for data. 572 00:43:04,258 --> 00:43:06,319 You need to feed them a lot of labeled images 573 00:43:06,319 --> 00:43:09,395 and labeled pixels for them to eventually work quite well. 574 00:43:09,395 --> 00:43:11,653 And, in the '90s there just wasn't 575 00:43:11,653 --> 00:43:14,141 that much labeled data available. 576 00:43:14,141 --> 00:43:17,489 This was, again, before tools like Mechanical Turk, 577 00:43:17,489 --> 00:43:20,232 before the internet was super, super widely used. 578 00:43:20,232 --> 00:43:21,871 And, it was very difficult to collect 579 00:43:21,871 --> 00:43:23,614 large, varied datasets. 580 00:43:23,614 --> 00:43:27,531 But, now in the 2010s with datasets like PASCAL 581 00:43:28,583 --> 00:43:31,633 and ImageNet, there existed these relatively large, 582 00:43:31,633 --> 00:43:34,228 high quality labeled datasets that were, again, 583 00:43:34,228 --> 00:43:36,590 orders and orders magnitude bigger 584 00:43:36,590 --> 00:43:38,775 than the dataset available in the '90s. 585 00:43:38,775 --> 00:43:40,622 And, these much large datasets, again, 586 00:43:40,622 --> 00:43:43,153 allowed us to work with higher capacity models 587 00:43:43,153 --> 00:43:45,261 and train these models to actually work quite well 588 00:43:45,261 --> 00:43:47,157 on real world problems. 589 00:43:47,157 --> 00:43:49,262 But, the critical takeaway here is 590 00:43:49,262 --> 00:43:51,023 that convolutional neural networks 591 00:43:51,023 --> 00:43:54,159 although they seem like this sort of fancy, new thing 592 00:43:54,159 --> 00:43:56,117 that's only popped up in the last couple of years, 593 00:43:56,117 --> 00:43:57,527 that's really not the case. 594 00:43:57,527 --> 00:43:59,583 And, these class of algorithms have existed 595 00:43:59,583 --> 00:44:03,666 for quite a long time in their own right as well. 596 00:44:05,015 --> 00:44:07,915 Another thing I'd like to point out 597 00:44:07,915 --> 00:44:09,724 in computer vision we're in the business 598 00:44:09,724 --> 00:44:12,755 of trying to build machines that can see like people. 599 00:44:12,755 --> 00:44:15,257 And, people can actually do a lot of amazing things 600 00:44:15,257 --> 00:44:16,650 with their visual systems. 601 00:44:16,650 --> 00:44:18,498 When you go around the world, 602 00:44:18,498 --> 00:44:21,034 you do a lot more than just drawing boxes 603 00:44:21,034 --> 00:44:24,988 around the objects and classifying things as cats or dogs. 604 00:44:24,988 --> 00:44:27,711 Your visual system is much more powerful than that. 605 00:44:27,711 --> 00:44:29,415 And, as we move forward in the field, 606 00:44:29,415 --> 00:44:31,612 I think there's still a ton of open challenges 607 00:44:31,612 --> 00:44:34,047 and open problems that we need to address. 608 00:44:34,047 --> 00:44:36,630 And, we need to continue to develop our algorithms 609 00:44:36,630 --> 00:44:40,220 to do even better and tackle even more ambitious problems. 610 00:44:40,220 --> 00:44:42,964 Some examples of this are going back to these older ideas 611 00:44:42,964 --> 00:44:44,043 in fact. 612 00:44:44,043 --> 00:44:46,923 Things like semantic segmentation or perceptual grouping 613 00:44:46,923 --> 00:44:49,292 where rather than labeling the entire image, 614 00:44:49,292 --> 00:44:51,969 we want to understand for every pixel in the image 615 00:44:51,969 --> 00:44:53,866 what is it doing, what does it mean. 616 00:44:53,866 --> 00:44:55,661 And, we'll revisit that idea a little bit later 617 00:44:55,661 --> 00:44:56,846 in the course. 618 00:44:56,846 --> 00:44:58,453 There's definitely work going back 619 00:44:58,453 --> 00:45:00,134 to this idea of 3D understanding, 620 00:45:00,134 --> 00:45:02,377 of reconstructing the entire world, 621 00:45:02,377 --> 00:45:06,127 and that's still an unsolved problem I think. 622 00:45:07,498 --> 00:45:09,010 There're just tons and tons of other tasks 623 00:45:09,010 --> 00:45:10,178 that you can imagine. 624 00:45:10,178 --> 00:45:11,817 For example activity recognition, 625 00:45:11,817 --> 00:45:13,438 if I'm given a video of some person 626 00:45:13,438 --> 00:45:15,212 doing some activity, what's the best way 627 00:45:15,212 --> 00:45:16,725 to recognize that activity? 628 00:45:16,725 --> 00:45:19,469 That's quite a challenging problem as well. 629 00:45:19,469 --> 00:45:21,286 And, then as we move forward with things 630 00:45:21,286 --> 00:45:23,274 like augmented reality and virtual reality, 631 00:45:23,274 --> 00:45:25,332 and as new technologies and new types of sensors 632 00:45:25,332 --> 00:45:27,578 become available, I think we'll come up 633 00:45:27,578 --> 00:45:29,955 with a lot of new, interesting hard and challenging 634 00:45:29,955 --> 00:45:32,455 problems to tackle as a field. 635 00:45:33,916 --> 00:45:37,924 So this is an example from some of my own work 636 00:45:37,924 --> 00:45:42,228 in the vision lab on this dataset called Visual Genome. 637 00:45:42,228 --> 00:45:45,426 So here the idea is that we're trying to capture 638 00:45:45,426 --> 00:45:47,474 some of these intricacies in the real world. 639 00:45:47,474 --> 00:45:49,793 Rather than maybe describing just boxes, 640 00:45:49,793 --> 00:45:52,308 maybe we should be describing images 641 00:45:52,308 --> 00:45:55,056 as these whole large graphs of semantically related 642 00:45:55,056 --> 00:45:57,525 concepts that encompass not just object identities 643 00:45:57,525 --> 00:46:00,451 but also object relationships, object attributes, 644 00:46:00,451 --> 00:46:02,590 actions that are occurring in the scene, 645 00:46:02,590 --> 00:46:06,971 and this type of representation might allow us 646 00:46:06,971 --> 00:46:09,527 to capture some of this richness of the visual world 647 00:46:09,527 --> 00:46:11,225 that's left on the table when we're using 648 00:46:11,225 --> 00:46:12,889 simple classification. 649 00:46:12,889 --> 00:46:15,270 This is by no means a standard approach at this point, 650 00:46:15,270 --> 00:46:17,330 but just kind of giving you this sense 651 00:46:17,330 --> 00:46:19,635 that there's so much more that your visual system can do 652 00:46:19,635 --> 00:46:22,590 that is maybe not captured in this vanilla 653 00:46:22,590 --> 00:46:24,840 image classification setup. 654 00:46:28,003 --> 00:46:29,744 I think another really interesting work 655 00:46:29,744 --> 00:46:31,592 that kind of points in this direction 656 00:46:31,592 --> 00:46:34,145 actually comes from Fei-Fei's grad school days 657 00:46:34,145 --> 00:46:36,843 when she was doing her PHD at Cal Tech 658 00:46:36,843 --> 00:46:38,952 with her advisors there. 659 00:46:38,952 --> 00:46:41,692 In this setup, they had people, they stuck people, 660 00:46:41,692 --> 00:46:44,604 and they showed people this image for just half a second. 661 00:46:44,604 --> 00:46:46,302 So they flashed this image in front of them 662 00:46:46,302 --> 00:46:47,896 for just a very short period of time, 663 00:46:47,896 --> 00:46:50,169 and even in this very, very rapid exposure 664 00:46:50,169 --> 00:46:52,108 to an image, people were able to write 665 00:46:52,108 --> 00:46:54,033 these long descriptive paragraphs 666 00:46:54,033 --> 00:46:56,473 giving a whole story of the image. 667 00:46:56,473 --> 00:47:00,284 And, this is quite remarkable if you think about it 668 00:47:00,284 --> 00:47:03,692 that after just half a second of looking at this image, 669 00:47:03,692 --> 00:47:05,560 a person was able to say that this is 670 00:47:05,560 --> 00:47:08,481 some kind of a game or fight, two groups of men. 671 00:47:08,481 --> 00:47:10,375 The man on the left is throwing something. 672 00:47:10,375 --> 00:47:13,134 Outdoors because it seem like I have an impression of grass, 673 00:47:13,134 --> 00:47:14,576 and so on and so on. 674 00:47:14,576 --> 00:47:16,016 And, you can imagine that if a person 675 00:47:16,016 --> 00:47:17,617 were to look even longer at this image, 676 00:47:17,617 --> 00:47:19,169 they could write probably a whole novel 677 00:47:19,169 --> 00:47:20,942 about who these people are, and why are they 678 00:47:20,942 --> 00:47:22,307 in this field playing this game. 679 00:47:22,307 --> 00:47:23,685 They could go on and on and on 680 00:47:23,685 --> 00:47:25,613 roping in things from their external knowledge 681 00:47:25,613 --> 00:47:27,187 and their prior experience. 682 00:47:27,187 --> 00:47:30,297 This is in some sense the holy grail of computer vision. 683 00:47:30,297 --> 00:47:32,659 To sort of understand the story of an image 684 00:47:32,659 --> 00:47:34,663 in a very rich and deep way. 685 00:47:34,663 --> 00:47:36,932 And, I think that despite the massive progress 686 00:47:36,932 --> 00:47:39,706 in the field that we've had over the past several years, 687 00:47:39,706 --> 00:47:44,460 we're still quite a long way from achieving this holy grail. 688 00:47:44,460 --> 00:47:46,563 Another image that I think really exemplifies 689 00:47:46,563 --> 00:47:50,472 this idea actually comes, again, from Andrej Karpathy's blog 690 00:47:50,472 --> 00:47:52,890 is this amazing image. 691 00:47:52,890 --> 00:47:54,391 Many of you smiled, many of you laughed. 692 00:47:54,391 --> 00:47:56,212 I think this is a pretty funny image. 693 00:47:56,212 --> 00:47:57,696 But, why is it a funny image? 694 00:47:57,696 --> 00:47:59,895 Well we've got a man standing on a scale, 695 00:47:59,895 --> 00:48:01,607 and we know that people are kind of self conscious 696 00:48:01,607 --> 00:48:04,380 about their weight sometimes, and scales measure weight. 697 00:48:04,380 --> 00:48:06,899 Then we've got this other guy behind him 698 00:48:06,899 --> 00:48:08,791 pushing his foot down on the scale, 699 00:48:08,791 --> 00:48:10,900 and we know that because of the way scales work 700 00:48:10,900 --> 00:48:12,958 that will cause him to have an inflated reading 701 00:48:12,958 --> 00:48:13,867 on the scale. 702 00:48:13,867 --> 00:48:14,895 But, there's more. 703 00:48:14,895 --> 00:48:16,819 We know that this person is not just any person. 704 00:48:16,819 --> 00:48:19,500 This is actually Barack Obama who was at the time 705 00:48:19,500 --> 00:48:20,905 President of the United States, 706 00:48:20,905 --> 00:48:22,541 and we know that Presidents of the United States 707 00:48:22,541 --> 00:48:24,741 are supposed to be respectable politicians that are 708 00:48:24,741 --> 00:48:27,045 [laughing] 709 00:48:27,045 --> 00:48:29,154 probably not supposed to be playing jokes 710 00:48:29,154 --> 00:48:31,304 on their compatriots in this way. 711 00:48:31,304 --> 00:48:32,713 We know that there's these people 712 00:48:32,713 --> 00:48:34,564 in the background that are laughing and smiling, 713 00:48:34,564 --> 00:48:36,066 and we know that that means that they're 714 00:48:36,066 --> 00:48:37,912 understanding something about the scene. 715 00:48:37,912 --> 00:48:39,597 We have some understanding that they know 716 00:48:39,597 --> 00:48:41,575 that President Obama is this respectable guy 717 00:48:41,575 --> 00:48:42,866 who's looking at this other guy. 718 00:48:42,866 --> 00:48:43,767 Like, this is crazy. 719 00:48:43,767 --> 00:48:45,830 There's so much going on in this image. 720 00:48:45,830 --> 00:48:48,167 And, our computer vision algorithms today 721 00:48:48,167 --> 00:48:51,108 are actually a long way I think from this true, 722 00:48:51,108 --> 00:48:53,002 deep understanding of images. 723 00:48:53,002 --> 00:48:56,032 So I think that sort of despite the massive progress 724 00:48:56,032 --> 00:48:58,777 in the field, we really have a long way to go. 725 00:48:58,777 --> 00:49:01,385 To me, that's really exciting as a researcher 726 00:49:01,385 --> 00:49:02,630 'cause I think that we'll have 727 00:49:02,630 --> 00:49:04,611 just a lot of really exciting, cool problems 728 00:49:04,611 --> 00:49:06,694 to tackle moving forward. 729 00:49:07,913 --> 00:49:10,202 So I hope at this point I've done a relatively good job 730 00:49:10,202 --> 00:49:13,054 to convince you that computer vision is really interesting. 731 00:49:13,054 --> 00:49:14,208 It's really exciting. 732 00:49:14,208 --> 00:49:16,329 It can be very useful. 733 00:49:16,329 --> 00:49:18,315 It can go out and make the world a better place 734 00:49:18,315 --> 00:49:20,043 in various ways. 735 00:49:20,043 --> 00:49:21,591 Computer vision could be applied 736 00:49:21,591 --> 00:49:24,559 in places like medical diagnosis and self-driving cars 737 00:49:24,559 --> 00:49:28,134 and robotics and all these different places. 738 00:49:28,134 --> 00:49:30,713 In addition to sort of tying back to sort of this core 739 00:49:30,713 --> 00:49:33,120 idea of understanding human intelligence. 740 00:49:33,120 --> 00:49:34,849 So to me, I think that computer vision 741 00:49:34,849 --> 00:49:37,141 is this fantastically amazing, interesting field, 742 00:49:37,141 --> 00:49:38,775 and I'm really glad that over the course 743 00:49:38,775 --> 00:49:40,475 of the quarter, we'll get to really dive in 744 00:49:40,475 --> 00:49:42,337 and dig into all these different details 745 00:49:42,337 --> 00:49:46,234 about how these algorithms are working these days. 746 00:49:46,234 --> 00:49:48,949 That's sort of my pitch about computer vision 747 00:49:48,949 --> 00:49:50,673 and about the history of computer vision. 748 00:49:50,673 --> 00:49:52,283 I don't know if there's any questions about this 749 00:49:52,283 --> 00:49:53,366 at this time. 750 00:49:55,707 --> 00:49:57,055 Okay. 751 00:49:57,055 --> 00:49:58,345 So then I want to talk a little bit more 752 00:49:58,345 --> 00:50:00,408 about the logistics of this class 753 00:50:00,408 --> 00:50:02,408 for the rest of the quarter. 754 00:50:02,408 --> 00:50:04,382 So you might ask who are we? 755 00:50:04,382 --> 00:50:06,904 So this class is taught by Fei-Fei Li 756 00:50:06,904 --> 00:50:11,271 who is a professor of computer science here at Standford 757 00:50:11,271 --> 00:50:14,516 who's my advisor and director of the Stanford Vision Lab 758 00:50:14,516 --> 00:50:16,852 and also the Stanford AI Lab. 759 00:50:16,852 --> 00:50:20,081 The other two instructors are me, Justin Johnson, 760 00:50:20,081 --> 00:50:22,519 and Serena Yeung who is up here in the front. 761 00:50:22,519 --> 00:50:25,219 We're both PHD students working under Fei-Fei 762 00:50:25,219 --> 00:50:27,379 on various computer vision problems. 763 00:50:27,379 --> 00:50:29,996 We have an amazing teaching staff this year 764 00:50:29,996 --> 00:50:31,920 of 18 TAs so far. 765 00:50:31,920 --> 00:50:34,179 Many of whom are sitting over here in the front. 766 00:50:34,179 --> 00:50:35,921 These guys are really the unsung heroes 767 00:50:35,921 --> 00:50:38,527 behind the scenes making the course run smoothly, 768 00:50:38,527 --> 00:50:40,320 making sure everything happens well. 769 00:50:40,320 --> 00:50:42,365 So be nice to them. 770 00:50:42,365 --> 00:50:44,196 [laughing] 771 00:50:44,196 --> 00:50:47,153 I think I also should mention this is the third time 772 00:50:47,153 --> 00:50:49,216 we've taught this course, and it's the first time 773 00:50:49,216 --> 00:50:51,652 that Andrej Karpathy has not been an instructor 774 00:50:51,652 --> 00:50:53,050 in this course. 775 00:50:53,050 --> 00:50:56,192 He was a very close friend of mine. 776 00:50:56,192 --> 00:50:57,093 He's still alive. 777 00:50:57,093 --> 00:50:58,353 He's okay, don't worry. 778 00:50:58,353 --> 00:50:59,612 [laughing] 779 00:50:59,612 --> 00:51:02,780 But, he graduated, so he's actually here 780 00:51:02,780 --> 00:51:05,724 I think hanging around in the lecture hall. 781 00:51:05,724 --> 00:51:07,662 A lot of the development and the history of this course 782 00:51:07,662 --> 00:51:09,570 is really due to him working on it 783 00:51:09,570 --> 00:51:11,617 with me over the last couple of years. 784 00:51:11,617 --> 00:51:15,398 So I think you should be aware of that. 785 00:51:15,398 --> 00:51:18,194 Also about logistics, probably the best way 786 00:51:18,194 --> 00:51:20,904 for keeping in touch with the course staff 787 00:51:20,904 --> 00:51:22,209 is through Piazza. 788 00:51:22,209 --> 00:51:25,212 You should all go and signup right now. 789 00:51:25,212 --> 00:51:27,597 Piazza is really our preferred method of communication 790 00:51:27,597 --> 00:51:30,353 with the class with the teaching staff. 791 00:51:30,353 --> 00:51:32,621 If you have questions that you're afraid 792 00:51:32,621 --> 00:51:34,313 of being embarrassed about asking 793 00:51:34,313 --> 00:51:36,067 in front of your classmates, go ahead 794 00:51:36,067 --> 00:51:38,602 and ask anonymously even post private questions 795 00:51:38,602 --> 00:51:40,572 directly to the teaching staff. 796 00:51:40,572 --> 00:51:42,269 So basically anything that you need 797 00:51:42,269 --> 00:51:44,452 should ideally go through Piazza. 798 00:51:44,452 --> 00:51:46,445 We also have a staff mailing list, 799 00:51:46,445 --> 00:51:48,422 but we ask that this is mostly 800 00:51:48,422 --> 00:51:51,302 for sort of personal, confidential things 801 00:51:51,302 --> 00:51:53,517 that you don't want going on Piazza, 802 00:51:53,517 --> 00:51:55,773 or if you have something that's super confidential, 803 00:51:55,773 --> 00:51:58,365 super personal, then feel free 804 00:51:58,365 --> 00:52:02,125 to directly email me or Fei-Fei or Serena about that. 805 00:52:02,125 --> 00:52:03,900 But, for the most part, most of your communication 806 00:52:03,900 --> 00:52:06,096 with the staff should be through Piazza. 807 00:52:06,096 --> 00:52:08,660 We also have an optional textbook this year. 808 00:52:08,660 --> 00:52:10,401 This is by no means required. 809 00:52:10,401 --> 00:52:12,616 You can go through the course totally fine without it. 810 00:52:12,616 --> 00:52:14,372 Everything will be self contained. 811 00:52:14,372 --> 00:52:17,770 This is sort of exciting because it's maybe the first 812 00:52:17,770 --> 00:52:19,786 textbook about deep learning that got published 813 00:52:19,786 --> 00:52:21,889 earlier this year by E.N. Goodfellow, 814 00:52:21,889 --> 00:52:24,078 Yoshua Bengio, and Aaron Courville. 815 00:52:24,078 --> 00:52:26,684 I put the Amazon link here in the slides. 816 00:52:26,684 --> 00:52:28,197 You can get it if you want to, 817 00:52:28,197 --> 00:52:30,079 but also the whole content of the book 818 00:52:30,079 --> 00:52:31,807 is free online, so you don't even have to buy it 819 00:52:31,807 --> 00:52:32,943 if you don't want to. 820 00:52:32,943 --> 00:52:34,261 So again, this is totally optional, 821 00:52:34,261 --> 00:52:35,778 but we'll probably be posting some readings 822 00:52:35,778 --> 00:52:37,614 throughout the quarter that give you an additional 823 00:52:37,614 --> 00:52:40,614 perspective on some of the material. 824 00:52:41,697 --> 00:52:43,259 So our philosophy about this class 825 00:52:43,259 --> 00:52:47,035 is that you should really understand the deep mechanics 826 00:52:47,035 --> 00:52:48,794 of all of these algorithms. 827 00:52:48,794 --> 00:52:50,671 You should understand at a very deep level 828 00:52:50,671 --> 00:52:52,717 exactly how these algorithms are working 829 00:52:52,717 --> 00:52:54,295 like what exactly is going on when you're 830 00:52:54,295 --> 00:52:56,097 stitching together these neural networks, 831 00:52:56,097 --> 00:52:58,128 how do these architectural decisions 832 00:52:58,128 --> 00:53:00,144 influence how the network is trained 833 00:53:00,144 --> 00:53:02,314 and tested and whatnot and all that. 834 00:53:02,314 --> 00:53:05,211 And, throughout the course through the assignments, 835 00:53:05,211 --> 00:53:07,163 you'll be implementing your own convolutional 836 00:53:07,163 --> 00:53:08,757 neural networks from scratch in Python. 837 00:53:08,757 --> 00:53:11,560 You'll be implementing the full forward and backward 838 00:53:11,560 --> 00:53:13,260 passes through these things, and by the end, 839 00:53:13,260 --> 00:53:15,106 you'll have implemented a whole convolutional neural network 840 00:53:15,106 --> 00:53:16,320 totally on your own. 841 00:53:16,320 --> 00:53:18,320 I think that's really cool. 842 00:53:18,320 --> 00:53:20,569 But, we also kind of practical, and we know 843 00:53:20,569 --> 00:53:23,520 that in most cases people are not writing these things 844 00:53:23,520 --> 00:53:25,613 from scratch, so we also want to give you 845 00:53:25,613 --> 00:53:27,769 a good introduction to some of the state of the art 846 00:53:27,769 --> 00:53:31,326 software tools that are used in practice for these things. 847 00:53:31,326 --> 00:53:33,373 So we're going to talk about some of the state of the art 848 00:53:33,373 --> 00:53:36,392 software packages like Tensor Flow, Torch, [Py]Torch, 849 00:53:36,392 --> 00:53:37,663 all these other things. 850 00:53:37,663 --> 00:53:39,890 And, I think you'll get some exposure 851 00:53:39,890 --> 00:53:42,636 to those on the homeworks and definitely through 852 00:53:42,636 --> 00:53:44,528 the course project as well. 853 00:53:44,528 --> 00:53:46,303 Another note about this course 854 00:53:46,303 --> 00:53:47,820 is that it's very state of the art. 855 00:53:47,820 --> 00:53:49,122 I think it's super exciting. 856 00:53:49,122 --> 00:53:50,715 This is a very fast moving field. 857 00:53:50,715 --> 00:53:53,337 As you saw, even these plots in the imaging challenge 858 00:53:53,337 --> 00:53:55,611 basically there's been a ton of progress 859 00:53:55,611 --> 00:53:58,840 since 2012, and like while I've been in grad school, 860 00:53:58,840 --> 00:54:00,538 the whole field is sort of transforming ever year. 861 00:54:00,538 --> 00:54:03,749 And, that's super exciting and super encouraging. 862 00:54:03,749 --> 00:54:07,177 But, what that means is that there's probably content 863 00:54:07,177 --> 00:54:09,132 that we'll cover this year that did not exist 864 00:54:09,132 --> 00:54:12,893 the last time that this course was taught last year. 865 00:54:12,893 --> 00:54:14,417 I think that's super exciting, and that's one 866 00:54:14,417 --> 00:54:16,629 of my favorite parts about teaching this course 867 00:54:16,629 --> 00:54:18,826 is just roping in all these new scientific, 868 00:54:18,826 --> 00:54:21,041 hot off the presses stuff and being able 869 00:54:21,041 --> 00:54:24,041 to present it to you guys. 870 00:54:24,041 --> 00:54:26,071 We're also sort of about fun. 871 00:54:26,071 --> 00:54:27,770 So we're going to talk about some interesting 872 00:54:27,770 --> 00:54:30,453 maybe not so serious topics as well this quarter 873 00:54:30,453 --> 00:54:33,122 including image captioning is pretty fun 874 00:54:33,122 --> 00:54:35,349 where we can write descriptions about images. 875 00:54:35,349 --> 00:54:37,177 But, we'll also cover some of these more artistic things 876 00:54:37,177 --> 00:54:39,896 like DeepDream here on the left 877 00:54:39,896 --> 00:54:42,261 where we can use neural networks to hallucinate 878 00:54:42,261 --> 00:54:44,277 these crazy, psychedelic images. 879 00:54:44,277 --> 00:54:45,975 And, by the end of the course, you'll know 880 00:54:45,975 --> 00:54:46,877 how that works. 881 00:54:46,877 --> 00:54:48,900 Or on the right, this idea of style transfer 882 00:54:48,900 --> 00:54:50,628 where we can take an image and render it 883 00:54:50,628 --> 00:54:54,507 in the style of famous artists like Picasso or Van Gogh 884 00:54:54,507 --> 00:54:55,340 or what not. 885 00:54:55,340 --> 00:54:56,654 And again, by the end of the quarter, 886 00:54:56,654 --> 00:54:59,654 you'll see how this stuff works. 887 00:54:59,654 --> 00:55:02,519 So the way the course works is we're going to have 888 00:55:02,519 --> 00:55:03,794 three problem sets. 889 00:55:03,794 --> 00:55:07,039 The first problem set will hopefully be out 890 00:55:07,039 --> 00:55:08,252 by the end of the week. 891 00:55:08,252 --> 00:55:10,706 We'll have an in class, written midterm exam. 892 00:55:10,706 --> 00:55:12,511 And, a large portion of your grade 893 00:55:12,511 --> 00:55:15,056 will be the final course project where you'll work 894 00:55:15,056 --> 00:55:17,407 in teams of one to three and produce 895 00:55:17,407 --> 00:55:20,514 some amazing project that will blow everyone's minds. 896 00:55:20,514 --> 00:55:23,871 We have a late policy, so you have seven late days 897 00:55:23,871 --> 00:55:26,380 that you're free to allocate among your different homeworks. 898 00:55:26,380 --> 00:55:29,549 These are meant to cover things like minor illnesses 899 00:55:29,549 --> 00:55:34,204 or traveling or conferences or anything like that. 900 00:55:34,204 --> 00:55:36,188 If you come to us at the end of the quarter 901 00:55:36,188 --> 00:55:38,757 and say that, "I suddenly have to give a presentation 902 00:55:38,757 --> 00:55:39,971 "at this conference." 903 00:55:39,971 --> 00:55:40,880 That's not going to be okay. 904 00:55:40,880 --> 00:55:42,624 That's what your late days are for. 905 00:55:42,624 --> 00:55:44,111 That being said, if you have some 906 00:55:44,111 --> 00:55:46,643 very extenuating circumstances, then do feel free 907 00:55:46,643 --> 00:55:48,705 to email the course staff if you have some extreme 908 00:55:48,705 --> 00:55:50,295 circumstances about that. 909 00:55:50,295 --> 00:55:52,404 Finally, I want to make a note 910 00:55:52,404 --> 00:55:54,177 about the collaboration policy. 911 00:55:54,177 --> 00:55:55,921 As Stanford students, you should all be aware 912 00:55:55,921 --> 00:55:58,389 of the honor code that governs the way 913 00:55:58,389 --> 00:56:00,785 that you should be collaborating and working together, 914 00:56:00,785 --> 00:56:03,609 and we take this very seriously. 915 00:56:03,609 --> 00:56:05,635 We encourage you to think very carefully 916 00:56:05,635 --> 00:56:07,620 about how you're collaborating and making sure 917 00:56:07,620 --> 00:56:11,037 it's within the bounds of the honor code. 918 00:56:12,304 --> 00:56:14,378 So in terms of prerequisites, I think the most important 919 00:56:14,378 --> 00:56:17,492 is probably a deep familiarity with Python 920 00:56:17,492 --> 00:56:20,081 because all of the programming assignments 921 00:56:20,081 --> 00:56:22,339 will be in Python. 922 00:56:22,339 --> 00:56:26,066 Some familiarity with C or C++ would be useful. 923 00:56:26,066 --> 00:56:29,354 You will probably not be writing any C or C++ 924 00:56:29,354 --> 00:56:31,705 in this course, but as you're browsing through the source 925 00:56:31,705 --> 00:56:33,676 code of these various software packages, 926 00:56:33,676 --> 00:56:35,922 being able to read C++ code at least 927 00:56:35,922 --> 00:56:39,879 is very useful for understanding how these packages work. 928 00:56:39,879 --> 00:56:42,439 We also assume that you know what calculus is, 929 00:56:42,439 --> 00:56:44,971 you know how to take derivatives all that sort of stuff. 930 00:56:44,971 --> 00:56:46,533 We assume some linear algebra. 931 00:56:46,533 --> 00:56:47,879 That you know what matrices are 932 00:56:47,879 --> 00:56:52,072 and how to multiply them and stuff like that. 933 00:56:52,072 --> 00:56:53,660 We can't be teaching you how to take 934 00:56:53,660 --> 00:56:55,691 like derivatives and stuff. 935 00:56:55,691 --> 00:56:57,321 We also assume a little bit of knowledge 936 00:56:57,321 --> 00:56:59,821 coming in of computer vision maybe at the level 937 00:56:59,821 --> 00:57:01,238 of CS131 or 231a. 938 00:57:02,367 --> 00:57:03,923 If you have taken those courses before, 939 00:57:03,923 --> 00:57:05,120 you'll be fine. 940 00:57:05,120 --> 00:57:07,347 If you haven't, I think you'll be okay in this class, 941 00:57:07,347 --> 00:57:09,853 but you might have a tiny bit of catching up to do. 942 00:57:09,853 --> 00:57:11,550 But, I think you'll probably be okay. 943 00:57:11,550 --> 00:57:13,704 Those are not super strict prerequisites. 944 00:57:13,704 --> 00:57:16,964 We also assume a little bit of background knowledge 945 00:57:16,964 --> 00:57:20,540 about machine learning maybe at the level of CS229. 946 00:57:20,540 --> 00:57:23,556 But again, I think really important, key fundamental 947 00:57:23,556 --> 00:57:25,723 machine learning concepts we'll reintroduce 948 00:57:25,723 --> 00:57:27,755 as they come up and become important. 949 00:57:27,755 --> 00:57:29,916 But, that being said, a familiarity with these things 950 00:57:29,916 --> 00:57:32,416 will be helpful going forward. 951 00:57:34,774 --> 00:57:36,046 So we have a course website. 952 00:57:36,046 --> 00:57:36,950 Go check it out. 953 00:57:36,950 --> 00:57:38,303 There's a lot of information and links 954 00:57:38,303 --> 00:57:39,742 and syllabus and all that. 955 00:57:39,742 --> 00:57:43,656 I think that's all that I really want to cover today. 956 00:57:43,656 --> 00:57:46,157 And, then later this week on Thursday, 957 00:57:46,157 --> 00:57:48,733 we'll really dive into our first learning algorithm 958 00:57:48,733 --> 00:00:00,000 and start diving into the details of these things.